Fits, and especially linear fits, with errors on both axes, 
extra variance of the data points and other complications 

G. D'Agostini 
Universita "La Sapienza" and INFN, Rome, Italy 
(giulio.dagostini@romal .infn.it, http : //www, romal . inf n. it/~dagos|) 



Abstract 

The aim of this paper, triggered by some discussions in the astrophysics community 
raised by astro-ph/0508529, is to introduce the issue of 'fits' from a probabihstic perspective 
(also known as Bayesian) , with special attention to the construction of model that describes 
the 'network of dependences' (a Bayesian network) that connects experimental observations 
to model parameters and upon which the probabilistic inference relies. The particular case 
of linear fit with errors on both axes and extra variance of the data points around the 
straight line (i.e. not accounted by the experimental errors) is shown in detail. Some 
questions related to the use of linear fit formulas to log-linearized exponential and power 
laws are also sketched, as well as the issue of systematic errors. 



Preamble 

This paper, based on things already written somewhere with the addition of some details 
from lectures, contains nothing or little especially new. Even the main 'result', summarized in 
Ea.(|S^ and that I hope will contribute to set down the questions raised by astro-ph/0508529] P, 



is just a simple extension of Eq. (8.33) of Ref. [2j. Therefore the debated question could be 
dismissed with a paper even shorter than astro-ph/0508529. Nevertheless, I have taken the 
opportunity to reorganize old material for the benefit of my students, and I post these pages 
hoping they could be of some utility to those who wish to understand what there is behind 
formulas. 
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1 Introduction 



A common task in data analysis is to 'determine', on the basis of experimental observations, 
the values of the parameters of a model that relates physical quantities. This procedure is 
usually associated to names like 'fit' and 'regression', and to principles, like 'least squares' 
or 'maximum likelihood' (with variants). I prefer, as many others belonging to a still small 
minority, to approach the problem from more fundamental probabilistic 'first principles', that 
are indeed the fundamental rules of probability theory. This approach is also called 'Bayesian' 
because of the central role played by Bayes' theorem in the process of learning from data, 
as we shall see in a while (for a critical introduction to the Bayesian approach see Ref. 
and references therein). In practice this means that we rank in probability hypotheses and 
numerical values about which we are not certain. This is rather intuitive and it is indeed 
the natural way physicists reason (see e.g. Ref. and references therein), though we have 
been taught a peculiar view of probability that does not allow us to make the reasonings we 
intuitively do and that we are going to use here. 

In the so called Bayesian approach the issue of 'fits' takes the name of parametric inference, 
in the sense we are interested in inferring the parameters of a model that relates 'true' values. 
The outcome of the inference is an uncertain knowledge of parameters, whose possible values 
are ranked using the language and the tools of probability theory. As it can only be (see e.g. 
Ref. for extensive discussions), the resulting inference depends on the inferential model 
and on previous knowledge about the possible values the model parameters can take (though 
this last dependence is usually rather weak if the inference is based on a 'large' number of 
observations). It is then important to state clearly the several assumptions that enter the data 
analysis. I hope this paper does it with the due care - and I apologize in advance for some 
pedantry and repetitions. The main message I would like to convey is that nowadays it is 
much more important to build up the model that describes at best the physics case than to 
obtain simple formulae for the 'best estimates' and their uncertainty. This is because, thanks 
to the extraordinary progresses of applied mathematics and computing power, in most cases 
the calculation of the integrals that come from a straight application of the probability theory 
does not require any longer titanic efforts. Building up the correct model is then equivalent, 
in most cases, to have solved the problem. 

The paper is organized as follows. In Section |21 the inferential approach is introduced from 
scratch, only assuming the multivariate extensions of the following well known formulas^ 



We show how to build the general model, and how this evolves as soon as the several hypotheses 
of the model are introduced (independence, normal error functions, linear dependence between 

^The meaning of the overall conditioning I will be clarified later. Note that, in order to simplify the notation, 
the generic symbol /( ) is used to indicate all probability density functions, though they might refer to different 
variables and have different mathematical expressions. In particular, the order of the arguments is irrelevant, 
in the sense that f{x,y\I) stands for 'joint probability density function of x and y under condition /', and 
therefore it could be also indicated by f{y,x \ I). For the same reason, the indexes of sums and products and 
the extremes of the integrals are usually omitted, implying they extend to all possible values of the variables. 



f{x,y\I) 
f{x\I) 




f{x\y,I)-f{y\I) 



(1) 
(2) 



2 



true values, vague priors). The graphical representation of the model in terms of the so called 
'Bayesian networks' is also shown, the utility of which will become self-evident. The case of 
linear fit with errors on both axes is then summarized in Section|31 and the approximate solution 
for the non-linear case is sketched in Section ^ The extra variability of the data is modeled 
in Section [3 first in general and then in the simple case of the linear fit. The interpretation 
of the inferential result is discussed in Section |HJ in which approximated methods to calculate 
the fit summaries (expected values and variance of the parameters) are shown. Finally, some 
comments on the not-trivial issues related to the use of linear fit formulas to infer the parameters 
of exponential and power laws are given in Section [7| Section [H] shows how to extend the 
model to include systematic errors, and some simple formulas to take into account offset and 
scale systematic errors in the case of linear fits will be provided. The paper ends with some 
conclusions and some comments about the debate that has triggered it. 



2 Probabilistic parametric inference from a set of data points 
with errors on both axes 

Let us consider a 'law' that relates the 'true' values of two quantities, indicated here by and 
fly: 

= Hyiflx; 0) , (3) 

where stands for the parameters of the law, whose number is M. In the linear case Eq. ^ 
reduces to 

liy = mux + c (4) 

i.e. 6 = {m, c} and M = 2. As it is well understood, because of 'errors' we do not observe 
directly and fiy, but experimental quantities^ x and y that might differ, on an event by 
event basis, from ^x and fiy. The outcome of the 'observation' (see footnote 2) Xi for a given 
fixi (analogous reasonings apply to yi and fiy-) is modeled by an error function f{xi \ fixi,!), 
that is indeed a probability density function (pdf) conditioned by ^Xi and the 'general state 
of knowledge' I. The latter stands for all background knowledge behind the analysis, that is 
what for example makes us to believe the relation ^y = iiy{^x] G), the particular mathematical 
expressions for f{xi\fixi,I) and f (y^ \ fj,y- , I) , and so on. Note that the shape of the error 
function might depend on the value of /x^,.., as it happens if the detector does not respond the 
same way to different solicitations. A usual assumption is that errors are normally distributed, 
i.e. 

Xi ~ M{iJ.xi,crx,) (5) 
Vi ~ ■^i^J■y^,cry^) , (6) 

where the symbol '~' stands for 'is described by the distribution' (or 'follows the distribution'), 
and where we still leave the possibility that the standard deviations, that we consider known, 

^These quantities might also be summaries of the data. I.e. they are either directly observed numbers, like 
readings on scales, or quantities calculated from direct observations, like averages or other 'statistics' based on 
partial analysis of the data. It is implicit that when summaries are used, instead of direct observations, the 
analyzer is somewhat relying on the so called 'statistical sufficiency'. 
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might be different in different observations. Anyway, for sake of generality, we sliall make use 
of assumptions © and © only in next section. 

If we think of N pairs of measurements of fi^ and fiy, before doing the experiment we are 
uncertain about 4N quantities (all x's, all y's, all Hx^s and all %'s, indicated respectively as x, 
y, fj,^ and Hy) plus the number of parameters, i.e. in total 4A^+M, that become 4A^+2 in linear 
fits. [But note that, due to believed deterministic relationship @, the number of independent 
variables is in fact 3A^ + M.] Our final goal, expressed in probabilistic terms, is to get the pdf 
of the parameters given the experimental information and all background knowledge: 

=^f{0\x,y,I) f{m,c\x,y,I) for linear fits ] . 

Probability theory teaches us how to get the conditional pdf f{0 \ x, y, I) if we know the joint 
distribution f{x, y, /x^., fiy, 6\I). The first step consists in calculating the 2N + M variable pdf 
(only + M of which are independent) that describes the uncertainty of what is not precisely 
known, given what it is (plus all background knowledge). This is achieved by a multivariate 
extension of Eq. 



f{x,y,fi^,Hy,e\I) 
f{x,y\I) 



/(/^x>/^y>^|aj,y,/) = _ I (7) 



f{x,y,ti^,fiy,e\I) 



I fix, y, fJ-x, fJ'y, 1 1) dfXy de 



(8) 



Equations ((TJ and © are two different ways of writing Bayes' theorem in the case of multiple 
inference. Going from to (jHI we have 'marginalized' /(a;, j/, /i^., /x^, | /) over /x^, /i^ and 
0, i.e. we used an extension of Eq. @ to many variables. [The standard text book version 
of the Bayes formula differs from Eqs. ((TJ and because the joint pdf's that appear on the 
r.h.s. of Eqs. O-® are usually factorized using the so called 'chain rule', i.e. an extension of 
Eq. (0 to many variables.] 

The second step consists in marginalizing the {2N + M)-dimensional pdf over the variables 
we are not interested to: 

f{e\x,y,I) = J f{tix,tJ-y,0\x,y,I) dfi^dfiy (9) 

Before doing that, we note that the denominator of the r.h.s. of Eqs. ((T))-® is just a number, 
once the model and the set of observations {x, y} is defined, and then we can absorb it in the 
normalization constant. Therefore Eq. © can be simply rewritten as 

f{6\x,y,I) oc J f{x,y,fix^t^y,^\I) dfi^dfJ-y. (10) 

We understand then that, essentially, we need to set up f{x,y,fi^,Hy,6\I) using the pieces 
of information that come from our background knowledge /. This seems a horrible task, but 
it becomes feasible tanks to the chain rule of probability theory, that allows us to rewrite 
f{x, y, /x^, fly, 0\I) in the following way: 
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■f{y\ tJ-x,tJ-y,^J) 

■f{fiy\fi^,e,i) 

■f{o\i) (11) 

(Obviously, among the several possible ones, we choose the factorization that matches our 
knowledge about of physics case.) At this point let us make the inventory of the ingredients, 
stressing their effective conditions and making use of independence, when it holds. 

• Each observation depends directly only on the corresponding true value '■ 

f{x\y,fi^,Hy,e,I) = f{x\ti^,I) = Y[f{xi\nx^,I) (12) 

i 

[^n-^('"-o^xj]- (13) 

i 

(In square brackets is the 'routinely' used pdf.) 

• Each observation yi depends directly only on the corresponding true value fiy. : 

f{y\tj-x,fj-y,d,i) = fivliJ-yJ) = Y[fiyi\^J■y^J) (i4) 

i 

[^ll^{f^y.,^yj]- (15) 

i 

• Each true value fiy depends only, and in a deterministic way, on the corresponding true 
value fix and on the parameters 6. This is formally equivalent to take an infinitely sharp 
distribution of /i^. around /iy(/i^.;0), i.e. a Dirac delta function: 

fi/J-y\t^x,d,I) = Y[S[fiy^- Hy{nx,,0)] (16) 

i 

[^Y[S{fiy^-mfix,-c)] (17) 

i 

• Finally, fixi and 9 are usually independent and become the priors of the problem,^ that 
one takes 'vague' enough, unless physical motivations suggest to do otherwise. For the Hxi 

Priors need to be specified for tfie nodes of a Bayesian network that have no parents (see Fig^and footnote 4). 
Priors are logically necessary ingredients, without which probabilistic inference is simply impossible. I understand 
that those who approach this kind of reasoning for the first time might be scared of this 'subjective ingredient', 
and because of it they might prefer methods advertised as 'objective' to which they are used, formally not 
depending on priors. However, if one thinks a bit deeper to the question, one realizes that behind the slogan of 
'objectivity' there is much arbitrariness, of which the users are often not aware, and that might lead to seriously 
wrong results in critical problems. Instead, the Bayesian approach offers the logical tool to properly blend 
prior judgment and empirical evidence. For further comments see Ref. where it is shown with theoretical 
arguments and many examples what is the role of priors, when they can be 'neglected' (never logically! - but 
almost always in routine data analysis), and even when they are so crucial that it is better to refrain from 
providing probabilistic conclusions. 
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[ for each i ] 



Figure 1: Graphical representation of the model in term of a Bayesian network (see text). 

we take immediately uniform distributions over a large domain (a 'flat prior'). Instead, 
we leave here the expression of f{6 \ I) undefined, as a reminder for critical problems (e.g. 
one of the parameter is positively defined because of its physical meaning) , though it can 
also be taken flat in routine applications with 'many' data points. 

/(/xj0,/)./(0|/) = f{^lJI)■f{e\I) (18) 
= k,fio\i) (19) 

The constant value of /(/^^ | /), indicated here by kx, is then in practice absorbed in the 
normalization constant. 

In conclusion we have 

(20) 

= Yi^xJixilux,,!) ■ fiVilfJ-y,,!) ■ S[fiy^- fiy{iJ,x,,0)] ■ f{0\I) (21) 

i 

« Y[fiXi\t^X,,l) ■ f{yi\f^y„l) ■ S[fly^ - fly{fix„0)] ■ f{6\l) . (22) 

i 

Figure ^ provides a graphical representation of the model [or, more precisely, a graphical 
representation of Eq. (|2U() ]. In this diagram the probabilistic connections are indicated by solid 
lines and the deterministic connections by dashed lines. These kind of networks of probabilistic 
and deterministic relations among uncertain quantities is known as 'Bayesian network',^ 'belief 

* According to Wtkipedia^, a Bayesian network "is a directed graph of nodes representing variables and arcs 
representing dependence relations among the variables. If there is an arc from node A to another node B, then we 
say that A is a parent of B. If a node has a known value, it is said to be an evidence node. A node can represent 
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network', 'influence network', 'causal network' and other names meaning substantially the same 
thing. From Eqs. (jlUj) and H22|l we get then 



f{e\x,y,I) oc 



oc fix, y\e,i)- fie I /) = cie ■,x,y)- fie \ i) 



f{o\i) 

(23) 
(24) 



where we have factorized the unnormalized 'final' pdf into the 'likelihood'^ C,id ] x,y) (the 
content of the large square bracket) and the 'prior' /(0 | /). 

We see than that, a part from the prior, the result is essentially given by the product of 
terms, each of which depending on the individual pair of measurements: 



fiO\x,y,I) 



oc 



]j£i(0; Xi,yi,I) 



(25) 



where 



CiiO ; Xi,yi) = fiXi,yi\e,I) = k^^ J /(Xi | /i^;,, /)• /(t/j | %,,/)• - %(/Xa:,, 0) ] 



kx, / fixi\nx,,l)-fiyi\fiyifix,,0),l)dfi^^ 



(26) 
(27) 



any kind of variable, be it an observed measurement, a parameter, a latent variable, or a hypothesis. Nodes are 
not restricted to representing random variables; this is what is "Bayesian" about a Bayesian network." [Note: 
here "random variable" stands for a random variable in the frequentistic acceptation of the term ('a la von Mises' 
randomness) and not just as 'variable of uncertain value'.] Bayesian networks represent both a conceptual and 
a practical tool to tackle complex inferential problems. They have indeed renewed the interest in the field of 
artificial intelligence, where they are used in inferential engines, expert systems and decision makers. Browsing 
the web you will find plenty of applications. Here just a few references: Ref. is a well known tutorial; Ref. |S| 
and |7| and good general books on the subject, the first of which is related to the HUGIN software, a lite 
version of it can be freely downloaded (Sj; for a fiash introduction to the issue, with the possibility of starting 
playing with Bayesian network on discrete problems JavaBayes is recommended, for which I have worked 
also a couple of examples in |1U|: for discrete and continuous variables that can be modeled with well known 
pdf, a good starting point is BUGS for which I have worked out some examples concerning uncertainties 
in measurements BUGS stands for Bayesian inference Using Gibbs Sampling. This means the relevant 

integrals we shall see later are performed by sampling, i.e. using Markov chain Monte Carlo (MCMC) methods. 
I do not try to introduce them here, and I suggest to look elsewhere. Good starting point can be the BUGS web 
page J3 and Ref. |13|. 

^Traditionally the name 'likelihood' is given to the probability of the data given the parameters, i.e. 
f{x, y\0,I), seen as a mathematical function of the parameters. Therefore the notation C{6 ; x, y) [not to be 
confused with f{6 \ x, y)\]. f{x, y\6,I) can be obtained marginalizing f{x, y, fi^,fiy \ 9, 1), i.e. f{x, y\0,I) = 
f f{x,y,n^,liy\0,l)dn^dtx^, where /(x, y, /i^, /i^ | 0, /) = f{x,y,n^,iiy,0\I)/f{6\I) is obtained from 
Eq. (Uni- It follows: 

.f{x,y,fJ-^,tJ'y\I) = 5[/iy, -fip-xAI) 



and 



f{x,y\e,I) = j W_}{xi\ii^i,I) ■ f{yi\^j.y„I) ■ 5[iiy^ - ^iy{fi^^,e)] ■ fipx, \I) dfi^^dfiy 
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and the constant factor /c^,, , irrelevant in the Bayes formula, is a reminder of the priors about 
fixi (see footnote 5). 

3 Linear fit with normal errors on both axes 

To apply the general formulas of the previous section we only need to make explicit fj,y- (/x^. . , 6) 
and the error functions, and finally integrate over /i^;.. In the case of linear fit with normal 
errors the individual contributions to the likelihoods become 



Ci{m,c; Xi,yi) 



kxi 



1 



27r (Tt 



exp 



{xi - ^lx,f 



2ai 



exp 



1 



27r o", 



exp 



24 



that, inserted into Eq. (|25jl . finally give 



f{m,c\x,y,I) oc J| 



exp 



{Ui -mxj- cf 

{Ui -mxj- c)2 



d^xi 
(28) 
(29) 



f{m,c\I) . 



(30) 



The effect of the error of the x-values is to have an effective standard error on the y- values that 
is the quadratic combination of ay and ax, the latter 'propagated' to the other coordinate via 
the slope m (this result can be justified heuristically by dimensional analysis). 



4 Approximated solution for non-linear fits with normal errors 

Linearity implies that the arguments of the exponential of the integrand in Eq. H28|) contains 
only first and second powers of fixi, and then the integrals has a closed solution. Though this 
is not true in general, the linear case teaches us how to get an approximated solution of the 
problem. We can take first order expansions of ^y{fj,x, 0) around each Xi 

Hyifix^d) ~ Hy{xi;6) + fj,y{xi;e) ■ {fix, - Xi) . (31) 

The difference yi — m fixi — c in Eq. PH|) . that was indeed equal to yi — /J-yifJ-Xi ', ^) in the general 
case, using the linear approximation becomes 

yi - Hy{xi;0) - Hyixi] 6) ■ {fix, - Xi) = yi - fj,y{xi;e) ■ fix, - [l^y{xi; 0) - fiy{xi]e) • Xj] , 
i.e. we have the following replacements in Eqs. (|28|) -(j3fl |) : 

m fi'y{xi;e) (32) 
c fiy{xi;6) - fiy{xi;6) ■ Xi . (33) 

The approximated equivalent of Eq. (|Hn|l is then 



fie\x,y,I) a« n 



exp 



[yi - fiy{xi;6)]' 



2[al+f,l\x,;e).al]\ 



fiO\I), (34) 



where the unusual symbol 'oc~' stands for 'approximately proportional to'. 
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5 Extra variability of the data 



As clearly stated, the previous results assume that the only sources of deviation of the mea- 
surements from the value of the physical quantities are normal errors, with known standard 
deviations a^i and Oy^ . Sometimes, as it is the case of the data points reported in Ref. [Tl| . 
this is not the case. This means that y depends also on other, 'hidden' variables, and what we 
observe is the overall effects integrated over all the variability of the variables that we do not 
'see'. In lack of more detailed information, the simplest modification to the model described 
above is to add an extra Gaussian 'noise' on one of the coordinates. For tradition and sim- 
plicity this extra noise is added to the y variable. The effect on the above result can be easily 
understood. Let us call the r.m.s. of this extra noise that acts normally and independently 
in each y point. As it is well known, the sum of Gaussian distributions is still Gaussian with an 
expected value and variance respectively sum of the individual expected values and variances. 
Therefore, the effect in the individual likelihoods H28() is to replace a^. by -|- a^. But we 
now have an extra parameter in the model, and Eq. (|30|) becomes 



f{m,c,a^ I x,y,I) 



oc 



n 



exp 



m Xi 



c? 



2 {al + al + 



2^2 



a. 



f{m,c,ay 1 1) 



(35) 



More rigorously, this formula can be obtained from a variation of reasoning followed in the 
previous section. 



fly depends on and on the set of hidden variables v: 



(36) 
(37) 



where the overall dependence iJ.y'\ ) has been split in two functions: z{fj,x,f^), only de- 
pending on Hx and the model parameters, corresponding to the ideal case; g{iix,v) de- 
scribing the difference from the ideal case. 

• Calling z the fictitious variable, deterministically dependent on Hx, for a given fixi we 
have the following model 



z{P'x,,0) : f{zi\fi^^,e,I) = 6[zi- z{fix,,d) 
■ fifJ■y^\zi,I) 



(38) 
(39) 



where /(/iy, | Zi,I) describes our uncertainty about fiy. due to the unknown values of all 
other hidden variables. 

We need now to specify finy^ \ Zi,I). As usual, in lack of better knowledge, we take 
a Gaussian distribution of unknown parameter a^, with awareness that this is just a 
convenient, approximated way to quantify our uncertainty. 
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[ for each i ] 



Figure 2: Minimal modification of Fig. to model the extra variability not described by the 
error functions. Note that stands for all model parameters to be inferred, including ay. 
Instead, stands for all parameters apart from . 

At this point a summary of all ingredients of the model in the specific case of linear model 
is in order: 



Vi 






(40) 


Xi 






(41) 




^ mux^ + c [ 


=> 5{zi - m + c)] 


(42) 




~ N'{zi,ay) 




(43) 




~ U{—(X),+(X)) 




(44) 


m, c, Gy 


=^ see later 


'uniform'], 


(45) 



where ZY(— oo,+cxd) stands for a uniform distribution over a very large interval, and the 
symbol ' has been used to deterministically assign a value, as done in BUGS fCTj (see 
later) . 

• We have now the extra parameter that we include in 0, so that M increases by 1. The 
new model in represented in Fig. |21 in which we have indicated by 0/av all parameters 
apart from a^. 

• The variables of the model are now 5A^ + M, and Eq. I|22() becomes 

f(.x,y,Hx,Hy,z,e\I) oc Y[f{xi\fix,,I) ■ f{yi\fJ-ii,,I) 

i 

■fit,y^\z„I)- 5[z,-z{i^,^,e)] ■f{e\I). (46) 

• Consequently, Eq. (|10|) becomes 

f{e\x,y,I) oc J f{x,y,fi^,fiy,z,e\I) dfi^dfiydz . (47) 
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Inserting the model functions (|4Ujl - (|45jl in Eq. (|46j) . after the marginahzation (|47|) and 
the factorization of the result into likelihood as prior [as previously done in (|24|l]. we get 
the analogues of Eqs. 

Ci{G] Xi,yi) 



f{xi\fix^,I) ■ fivi I • fifiy^ \zi,9,I) ■ 6[zi - z{fix^,6)] dn^AiXyAzi 

(48) 

j f{xi I ^lx,,I) ■ fiVi I fJ-y,,!) ■ I z{nx,,0),6,I) dfix.dfiy^ 



1 



1 



exp 



2al. 



27r (T,j 



exp 



24 



2-K 
1 



exp 



27r (Tt 



exp 



2al 



dflxidflyi 



1 



1 



exp 



\/2^ Jcr2 + a2 +mV2 



exp 



2(ct2 + 4) 

(/i^- -mxj- c)2 
'2(ct2 + 4 + mV2^ 



(49) 



(50) 



(51) 
(52) 



Inserting in Eq. (|25j) the expression of £i(0 ; Xi,yi) coming from Eq. (|52|) we get finally 
Eq. (ESJ. 



6 Computational issues: normalization, fit summaries, priors 
and approximations 

At this point it is important to understand that in Bayesian approach the full result of the 
inference is given by final distribution, that in our case is - we rewrite it here: 



f{m,c,ay I x,y,I) 



exp 



{Vi -mxj- c)2 



f{m,c,ay I /) , 
(53) 



where k is 'simply' a normalization factor. (This factor is usually the most difficult thing to 
calculate and it is often obtained approximately by numerical methods. But this is, in principle, 
just a technical issue.) Once we have got k we have a full knowledge about f{m, c, \ x, y, I) 
and therefore about our uncertainty concerning the model parameters, the distribution of each 
of which can be obtained by marginahzation: 



f{m\x,y,I) = J f{m,c,av\x,y,I) dcda^ 



(54) 
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f{c\x,y,I) = J f{m,c,av\x,y,I) drnda^ 
f{ay\x,y,I) = J f{m,c,ay\x,yj) dmdc. 
Similarly the joint distribution of m and c can be obtained as 

f{ni,c\x,y,I) = Jf{m,c,a^\x,y,I)dav, 



(55) 
(56) 

(57) 



from which we can easily see that we recover Eq. (|30|) in the case we think the extra variability 
discussed in the previous section is absent. This limit case corresponds to a prior of sharply 
peaked around zero, i.e. f{ay \ I) = S{ay). 

Other interesting limit cases are the following. 

• Errors only on the y axis and no extra variability. 

Making the limit of Eq. ()3U() for a^^ and neglecting irrelevant factors we get 



f{m,c\x,y,I) (X Y[ex.p 



{Vi -mxj- cf 



oc exp 



1 (yi - mxj - c)2 



2T 



f{m,c\I) 
f{m,c\I) 



(58) 
(59) 



This is the best known and best understood case. 

Errors only on the y axis and extra variability. 
Making the limit of Eq. (jSHJ for u^- 



f{m,c,av\x,y,I) oc J| 



1 



exp 



{yj -mXj- 
2(^2 + 4) 



f{m,c,av\I). (60) 



Scattering of data point around the hypothesized straight line only due to 'extra vari- 
ability'. 



f{m,c,ay\x,y,I) oc a^^J^exp 



{yi -mxi- c)" 



2a2 



-N 

OC (T„ exp 



2a2 ^ 



m Xi 



f{m,c,ay\I) (61) 
fim,c,ay\I). (62) 



This case corresponds to the joint determination of m, c and made by the method of the 
'residuals', that can be considered a kind of approximated solution of Eq. (|6H) . achieved 
by iteration. [Indeed, if there are 'enough' data points the 'best estimates' achieved by 
the residual method are very close to the expected values of m, c and ay evaluated from 
/(m, c, ay I X, y, I) if we assumed a flat prior distribution for the parameters.] 
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Although, as it has been pointed out above, the full result of the inference is provided by the 
final pdf, often we do not need such a detailed description of our uncertainty, and we are only 
interested to provide some 'summaries'. The most interesting ones are the expected values, 
standard deviations and correlation coefficients, i.e. E{m), E{c), E(ay), a"(m), cr{c), cr{av), 
p{m, c), p{m, a^) and p{c, a^). They are evaluated from /(m, c, a"„) using their definitions, that 
are assumed to be known [hereon we often omit the conditions on which the pdf depends, 
and we write f{m,c,ay) instead of fim^c^a^ |a;,y,/), and so on]. Obviously, these are not 
the only possible summaries. One might report in addition the mode or the median of each 
variable, one-dimensional or multi-dimensional probability regions [i.e. regions in the space 
of the parameters that are believed to contain the true value of the parameter(s) with a well 
defined probability level], and so on. It all depends on how standard or unusual the shape 
of f{m,c,(Ty) is. I just would like to stress that the most important summaries are expected 
value, standard deviation and correlation coefficients, because these are the quantities that 
mostly matter in subsequent evaluations of uncertainty. Giving only 'most probable' values 
and probability intervals might bias the results of further analyzes |15| . 

The prior f{m, c, \ I) has been left on purpose open in the above formulas, although we 
have already anticipated that usually a flat prior about all parameters gives the correct result 
in most 'healthy' cases, characterized by a sufficient number of data points. I cannot go here 
through an extensive discussion about the issue of the priors, often criticized as the weak point 
of the Bayesian approach and that are in reality one of its points of force. I refer to more 
extensive discussions available elsewhere (see e.g. and references therein), giving here only 
a couple of advices. A flat prior is in most times a good starting point (unless one uses some 
packages, like BUGS 11 , that does not like flat prior in the range —oo to -|-oo; in this case 
one can mimic it with a very broad distribution, like a Gaussian with very large a). If the 
result of the inference 'does not offend your physics sensitivity', it means that, essentially, flat 
priors have done a good job and it is not worth fooling around with more sophisticated ones. 
In the specific case we are looking closer, that of Eq. (|53|) . the most critical quantity to watch 
is obviously a^, because it is positively defined. If, starting from a fiat prior (also allowing 
negative values), the data constrain the value of in a (positive) region far from zero, and 
- in practice consequently - its marginal distribution is approximatively Gaussian, it means 
the fiat prior was a reasonable choice. Otherwise, the next-to-simple modeling of cr^ is via the 
step function 9{ay). A more technical choice would be a gamma distribution, with suitable 
parameters to 'easily' accommodate all envisaged values of o"t,. 

The easiest case, that happens very often if one has 'many' data points (where 'many' 
might be already as few as some dozens), is that f{m,c,av) obtained starting from flat priors 
is approximately a multi-variate Gaussian distribution, i.e. each marginal is approximately 
Gaussian. In this case the expected value of each variable is close to its mode, that, since 
the prior was a constant, corresponds to the value for which the likelihood £(m,c, cj^ ; x,y) 
gets its maximum. Therefore the parameter estimates derived by the maximum likelihood 
principle are very good approximations of the expected values of the parameters calculated 
directly from f{m,c,av). In a certain sense the maximum likelihood principle best estimates 
are recovered as a special case that holds under particular conditions (many data points and 
vague priors). If either condition fails, the result the formulas derived from such a principle 
might be incorrect. This is the reason I dislike unneeded principles of this kind, once we have 
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a more general framework, of which the methods obtained by 'principles' are just special cases 
under well defined conditions. 

The simple case in which f{m,c,ay) is approximately multi-variate Gaussian allows also 
to approximately evaluate the covariance matrix of the fit parameters from the Hessian of 
its logarithm.^ This is due to a well known property of the multi-variate Gaussian and it is 
not strictly related to flat priors. In fact it can easily proved that if the generic f{9) is a 
multivariate Gaussian, then 



(63) 

6 = 9m 



where 

if {6) = -log/(0), (64) 

Vij{6) is the covariance matrix of the parameters and 6^ is the value for which f{9) gets its 
maximum and then f{9) its minimum. 

An interesting feature of this approximated procedure is that, since it is based on the 
logarithm of the pdf, normalization factors are irrelevant. In particular, if the priors are flat, 
the relevant summaries of the inference can be obtained from the logarithm of the likelihood, 
stripped of all irrelevant factors (that become additive constants in the logarithm and vanish 
in the derivatives). Let us write down, for some cases of interest, the minus-log-likelihoods, 
stripped of constant terms and indicated by L, i.e. f{0 ; x,y) = L{6 ; x,y) + const. 

• Simplest case: linear fit with only known errors on the y axis [from Eq. H58|) ]: 

I -sr-^ {Ui — m Xi — c)'^ 1 2/ 
L{m,c;x,y) = ^ = - x [m, c; x,y) , (65) 

where we recognize the famous chi-squared. Applying Eq. H63|) we get then the covariance 
matrix of the fit parameters as 



_ ld'^x^{m,c; x,y) 

\<' )m,c 



2 dm dc 



(66) 

m = rUm 



(See Ref. )^ for the fully developed example yielding analytic formulas for the expected 
values and covariance matrix of the m and c.) Note that the often used (but also often 
misused! ^S|) '-^X^ = 1 rule' to calculate the covariance matrix of the parameters comes 
from the same Gaussian approximation of the final pdf and prior insensitivity. [And, 
because of the factor 1/2 between Eqs. (|63|) and (|66|) . there is an equivalent 'A minus- 
log-likelihood = 1/2' rule, applicable under the same conditions]. 



®I would like to point out that I added the formulas that follow just for the benefit of the inventory. Personally, 
in such low dimensional problems I find it easier to perform numerical integrations than to evaluate, obviously 
with the help of some software, derivatives, find minima and invert matrices, or to use the 'Ax^ = 1' or 
'A minus- log-likelihood = 1/2' rules. Moreover, I think that the lazy use of computer programs solely based 
on some approximations produces the bad habit of taking acritically their results, even when they make no 
sense [TF|. Nevertheless, with some reluctance and after these warnings, I give here the formulas that follows, 
and that the reader might know as derived from other ways, hoping he/she understands better how they can be 
framed in a more general scheme, and therefore when it is possible to use them. 
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• Errors also on the y axis: 



Lim,c;x,y) = ^ E ^"g K + O + ^ E f • (67) 

In this case expected values and covariance matrix cannot be obtained directly in closed 
form. Nevertheless, one can use iteratively the formulas for = in which the estimate 
of m is used to evaluate the terms a^.. (having the meaning of effective y-error) in 

the likelihood of the next iteration. Instead it is wrong to simply replace the denominator 
of the of Eq. H65() with cjy . + m? a^- , because this approximation does not take into 
account the first term of the r.h.s. of Eq. (|67j) and the slope m will be underestimated 
(as a consequence, the intercept c will be over- or under-estimated, depending on the sign 
of the correlation coefficient between m and c, a sign that depends on the sign of the 
barycenter of the x points.) 

• Dispersion on the y axis only due to ct„ [from Eq. (|HT|) ]: 

L{m,c,ay\ x,y) = N log + -^^{yi - mxi - cf . (68) 

• The most complete case seen here [from Eq. (|53|) ]: 

L{m,c,a^; x,y) = - ^ log (a^ + + m a^J + - ^ -2^-^^— 2^ . (69) 

i i ^ Vi 

• As the previous item, but for the general fiy{ ) [from Eq. (|34|)]: 

Tfa \ ^ \^ ^ r 2 I 2 , /2/ a\ 2i , ^ [Ui ~ t^yi^ii ^)]'^ 

L{e,a,-x,y) « )_^log[a,+a^^+/z, {x,-,e) • a,^] + . 

(70) 

7 From power law to linear fit 

Linear fits are not only used to infer the parameters of a linear model, but also of other 
models that are linearized via a suitable transformation of the variables. The best known 
cases are the exponential law, linearized taking the log of the ordinate, and the power low, 
linearized taking the log of both coordinates. Linearizion is particularly important to provide 
a visual evidence in support of the claimed model. However, quantitative inference based on 
the transformed variable is not so obvious, if high accuracy in the determination of the model 
parameters is desired. Let us make some comments on the power law, in which both variables 
are log-transformed and therefore more general. 
We start hypothesizing a model 

B = kA^, (71) 
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that is linearized as 



log 5 



7 log A + log K . 



(72) 



We identify then log B with fiy of the linear case, log A with Hx, 7 with m and log k. with c. But 
this identification does not allows us yet to use tout court the formulas derived above, because 
each of them depends on a well defined model. Let us see where are the possible problems. 

• In the simplest model Oj is normally distributed around Ai and bi around Bi (we indicate 
by a and b the set of observations in the original variables). But, in general, Xi = 
logflj and Hi = logftj are not normally distributed around f^x^ = log A and fiy. = logB, 
respectively. They are only when the measurements are very precise, i.e. Cai/ai <C 1 and 
Cbi/bi ^ 1. This the case in which standard 'error propagation', based on the well known 
formulas base on linearization, holds. 

• If the precision is not very high, i.e. Ca./aj and (Tf,./6j are not very small, non-linear 
effects in the transformations could be important (see e.g. Ref. jl5j). 

• When some of dajai and (Thjbi approach unity it becomes important to consider the 
error functions and the priors about A and B with the due care. For example, very often 
the quantities A and B are defined positive - and if we take their logarithms, they have 
to be positive. This requires the model to be correctly set up in order to prevent negative 
values of A and B. 

Further considerations would require a good knowledge of the the experimental apparatus and 
of the physics under study. Therefore I refrain from indicating a toy model, that could be used 
acritically in serious applications. Instead I encourage to draw a graphical representation of the 
model, as done in Figs. ^ and [2 and to make the inventory of the ingredients. Sometimes the 
representation in terms of Bayesian network is almost equivalent to solve the problem, thanks 
also to the methods developed in the past decades to calculate the relevant integrals, using 
e.g. Markov Chain Monte Carlo (MCMC), see e.g. Ref. ^Hl and references therein. In case of 
simple models one can even use free available software, like BUGS 

8 Systematic errors 

Let us now consider the effect of systematic errors, i.e. errors that acts the same way on 
all observations of the sample, for example an uncertain offset in the instrument scale, or an 
uncertain scale factor. I do not want to give a complete treatment of the subjects, but focus 
only on how our systematic effects modify our graphical model, and give some practical rules for 
the simple case of linear fits. (For an introduction about systematic errors and their consistent 
treatment within the Bayesian approach see Ref. Pj.) 

For each coordinate we can introduce the fictitious quantities /if and that take into 
account the modification of and fiy due to the systematic effect. For example, if the system- 
atic effects only acts as an offset, i.e. we are uncertain about the 'true' zero of the instruments, 
Qx and C,y, we have 




,x 



(73) 
(74) 
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Figure 3: Graphical model of Fig. \^ with the addition of systematic errors on both axes. 



where the true value of are unknown (otherwise there would be no systematic errors). We 
only know that their expected value is zero (otherwise we need to apply a calibration constant 
to the measurements) and we quantify our uncertainty with pdf's. For example, we could 
model them with Gaussian distributions: 



Cx 



■AA(0,a^J 



(75) 
(76) 



Anyway, for sake of generality, we leave the systematic effects in the most general form, depen- 
dent on the uncertain quantities (3^ and Py [to be clear: in the case of solely offset systematics 



we have = {(x} = {Cy}]- The values of /Uj 



Px 

Py 



and are modeled as follow 

i^x ii^xi ; Px) 



(77) 
(78) 
(79) 
(80) 



Px-fiPx\I) 

Py^fi(3y\I). 

Figure El shows the graphical model containing the new ingredients. The links — > Xi and 
Py — > Vi are to remember that systematics could also effect the error functions. An alternative 
visual picture of the probabilistic model is shown in Fig. ^ Note the different symbols to 
indicate the different uncertain processes: the divergent arrows (in yellow, if you are reading 
an electronic version of the paper) indicate that, given a value of the 'parent' variable, the 
'child' variable fluctuates on an event-by-event basis; the green single arrow with the question 
mark indicate that, given a value of the 'parent', the child will always take a fixed value, though 
we do not know which one. 
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Figure 4: A different visual representation of tlie probabilistic model of Fig. 

Obviously, the practical implementation of complicate systematic effects in complicate fits 
can be quite challenging, but at least the Bayesian network provides an overall picture of 
the model. The simplest case is that of linear fit where only offset and scale uncertainty are 
present, with uncertainty modeled by a Gaussian distribution. This means that the /3's and 
their uncertainty are as follows (ry is the scale factor of uncertain value): 

Px = {Cx,Vx} f3y = {Cy,Vy} (81) 

G~AA(0,acJ Cj;~AA(0,a^J (82) 

77^ ~ AA(1, a^J rjy ~ AA(1, a^J (83) 

In this case we can get an hint of how the uncertainty about m and c change without doing 
the full calculation following an heuristic approach, valid when f{m,c) is approximately mul- 
tivariate Gaussian and the details of which can be found in Ref. JHI • We obtain the following 
results, in which a{m)\^^ indicates the contribution to the uncertainty about the slope m due 
to uncertainty about Cx, <7(m)| that due to the scale factor rjx, and so on'': 



a{m)\^^ = (84) 

a(m)|^^ = (85) 

f^(c)|^^ = k|(T^, (86) 

^(c)lc„ = ^C. (87) 



^In Ref. ^S] is indicated by Zx, rjx by fx, and so on. 
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a{m) 




= kl (Trix 


(88) 


a{m) 


% 


= kk^. 


(89) 


a{c) 




= 


(90) 


a{c) 






(91) 



All contributions are then added quadratically to the so called 'statistical' ones. 

9 Conclusions 

The issue of fits has been approached from probability first principles, i.e. using throughout 
the rules of probability theory, without external ad hoc ingredients. It has been that the main 
task consists in building up the inferential model, that means in fact to properly factorize 
the joint probability density function of all variables of the problem. We have seen that this 
factorization, based on the so called chain rule of probability theory, has a very convenient 
graphical representation, that takes the name of Bayesian (or belief /causal/infiuence) network. 
Modeling the problem in terms of such networks not only helps to understand the problem 
better, but, thanks the huge amount of mathematical developments relates to them, it becomes 
the only way to get a (numerical) solution when problems get complicated. 

We have also seen how to recover well known formulas, obtained starting from other ap- 
proaches, under well defined conditions, thus indicating that other methods can be seen as 
approximations of the most general one, and that are therefore applicable if the conditions of 
validity hold. 

The linear case with errors on both axis and extra variance of the data has been shown 
with quite some detail, giving un-normalized formulas for the pdf. In particular, going to the 
pretext to write this paper, we can see that Eq. (43) of Ref. ^\ is not reproduced. In fact, 
if I understand it correctly, that equation should have the same meaning of Eq. (|53j) of this 
paper. However, Eq. (43) of Ref. J7j contains an extra factor \/l + m? (using the notation 
of this paper), that it is a bit odd, for several reasons (besides the fact that I do not get it 
- but this could be judged a technical argument by the hurry reader). The first reason is 
just dimensionality: mx \s homogeneous with y and for this reason max can be combined 
(quadratically) to ay, but m? cannot be added tout court to 1. The second is that if there was 
such a factor in Eq. H53() . then one cannot reproduce Eqs. (|Hn|) and H61() . that one can 

be obtained in simpler ways (and that give rise to the likelihoods shown in Section IHl some of 
them rather well known). Note that the addition of a term Vl + in Eq. H53|) has the net 
effect of overestimating m, an effect that is consistent with the claim by 1_ of a slope larger 
than that obtained by P^.^ 

*As a rule of thumb, since the extra variance of the data of |14) is rather important, the slope has to be very 
close to that obtained neglecting all a^^ and (Ty. and making a very simple least square regression. 
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